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Abstract 

The finite element method (FEM) has several computational steps to numeri¬ 
cally solve a particular problem, to which many efforts have been directed to 
accelerate the solution stage of the linear system of equations. However, the fi¬ 
nite element matrix construction, which is also time-consuming for unstructured 
meshes, has been less investigated. The generation of the global finite element 
matrix is performed in two steps, computing the local matrices by numerical 
integration and assembling them into a global system, which has traditionally 
been done in serial computing. This work presents a fast technique to construct 
the global finite element matrix that arises by solving the Poisson’s equation 
in a three-dimensional domain. The proposed methodology consists in comput¬ 
ing the numerical integration, due to its intrinsic parallel opportunities, in the 
graphics processing unit (GPU) and computing the matrix assembly, due to its 
intrinsic serial operations, in the central processing unit (CPU). In the numer¬ 
ical integration, only the lower triangular part of each local stiffness matrix is 
computed thanks to its symmetry, which saves GPU memory and computing 
time. As a result of symmetry, the global sparse matrix also contains non-zero 
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elements only in its lower triangular part, which reduces the assembly operations 
and memory usage. This methodology allows generating the global sparse ma¬ 
trix from any unstructured finite element mesh size on GPUs with little memory 
capacity, only limited by the CPU memory. 

Keywords: Finite element method (FEM), unstructured mesh, matrix 
generation, parallel computing, graphic processing unit (GPU), heterogeneous 
computing 


1. Introduction 

Numerous physical phenomena in a stationary condition such as electri¬ 
cal and magnetic potential, heat conduction, fluid flow and elastic problems 
in a static condition can be described by elliptic partial differential equations 
(EPDEs). The EPDEs does not involve a time variable, and so describes the 
steady state response. A linear EPDE has the general form given in Eq. Q, 
where a, 6, c are coefficient functions, / is a source (excitation) function and u 
is the unknown variable. All of these functions can vary spatially (x, y, z). 

\7{c ■ Vu) + b ■ Vu + au = f (1) 

EPDEs can be solved exactly by mathematical procedures such as Fourier 
series [T]. However, the classical solution does not frequently exist; for those 
problems which allow the use of these analytical methods, many simplifications 
are made [2]. Consequently, several numerical methods have been developed 
such as the finite element method (FEM) and finite difference method to effi¬ 
ciently solve EPDEs. 

The FEM has several advantages over other methods. The main advantage 
is that it is particularly useful for problems with complicated geometry using 
unstructured meshes [5] . One way to get a suitable framework for solving EPDEs 
problems, by using FEM, is formulating them as variational problems, also called 
weak solution. 


2 



The variational formulation of an EPDE is a mathematical treatment for 
converting the strong formulation into a weak formulation, which permits the 
approximation in elements or subdomains, and the EPDE variational form is 
solved using the so-called Galerking method 0 - Fig. g summarizes the steps 
required to numerically solve an EPDE by using FEM. Some projects are pub¬ 
licly available and all of these steps are solved automatically in the computer 
[3]; however, the most common commercial FEM packages are focused on steps 
4-8, which are: 

(i) Domain discretization with finite elements (EEs); 

(ii) Numerical integration of all EEs. In this step, a local matrix fee and a 
local load vector are computed for each finite element; 

(iii) Construction of the global sparse matrix K from local matrices fee and 
global load vector F from local load vectors 

(iv) Application of boundary conditions; 

(v) Solution of the linear equation system formed previously, KU = F, where 
U represents the vector of the nodal solution. 


1. Obtaining the 
PDE for modelling 
the physical 
phenomena 

— ► 

2. Formulating a 
weak problem from 
the PDE 

-► 

3. Solving the weak 
problem with 
Galerkin method 

-► 

4. Domain 

discretization with 

finite elements 


I 


8. Solution of the 
linear equation 
system 

< - 

7. Application of 
boundary condition 

< 

6. Construction of 
the global sparse 
matrix and global 
load vector 

<— 

5. Numerical 
integration at each 
finite element 


Figure 1: FEM steps to solve EPDEs. 


Many efforts have been made to accelerate the linear equation solver by using 
direct [5] and iterative methods [5]; however, the FE sparse matrix construction 
(steps 5 and 6 of Fig. [^, which are also time-consuming for dense unstructured 
meshes, have been less investigated [7]. Hence, in this work the fast construction 
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of the global sparse matrix is developed, which arises in the numerical solution 
of EPDEs by EEM. 

One way to efficiently construct the global FE matrix is by using parallel 
computing. Parallelism is a trend in current computing and, for this reason, 
microprocessor manufacturers focus on adding cores rather than on increasing 
single-thread performance [8]. Although there are several parallel architectures, 
the most widespread many-core architecture are the GPUs (graphics process¬ 
ing units), and the most common multi-core architecture are the CPUs (central 
processing units) |9]. However, due to the massive parallelism offered by GPU 
compared to CPU, the trend in scientific computing is to use GPU for acceler¬ 
ating intensive and intrinsic parallel computations [TU]. Additionally, the GPU 
is preferred over the other architectures to be used as a co-processor to ac¬ 
celerate part or the whole code because its market is well established in the 
computer games industry. This fact allows having a graphic processor in most 
personal computers, which makes GPUs an economic and attractive option for 
application developers [10]. 

Although some researches are focused on completely porting the EEM sub¬ 
routines to be executed in GPU [nmn, other works are focused on porting parts 
of the EEM code to the GPU, such as numerical integration dl [H EH ED, 
matrix assembly [niEii, and the solution of large linear systems of equations 
m EOi mi. Even though some previous works accelerated the construction 
of sparse FE matrix [3 mi US] , these approaches require high GPU memory, 
which means that the size of the problem that may be analyzed is limited or an 
expensive GPU with high memory capacity is needed. However, in most per¬ 
sonal computers equipped with a graphics processor, the GPU has low-memory 
capacity. 

Accordingly, the proposed approach is based on the assumption that a sin¬ 
gle GPU has significantly less memory than the GPU; therefore, a GPU-GPU 
implementation is developed. The central idea is computing the numerical inte¬ 
gration in the GPU, which is an intensive and intrinsic parallel subroutine, and 
computing the assembly in the CPU, which is a predominantly serial subroutine 
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with high memory consumption. Hence, based on this methodology, we are able 
to construct any large sparse matrix arising in FEM, only limited by the CPU 
memory. 

The rest of the paper is organized as follows. In section the FEM formu¬ 
lation of the Poisson’s equation used as an example for checking the proposed 
implementation is given. Section presents the CPU-GPU implementation of 
the FE matrix construction. The results are discussed in section and finally, 
in section the conclusions are drawn. 

2. Elliptic PDE modelling by FEM 

The computational benefits of our proposed implementation in the construc¬ 
tion of large sparse FE matrices is shown by solving the Poisson’s equation in a 
three-dimensional (3D) domain. The Poisson’s equation is an elliptic-type PDE. 
This type of equation is commonly used to solve scalar problems such as electri¬ 
cal and magnetic potential, heat transfer and other problems with no external 
flows [3]. In addition, the Poisson’s equation is used for modelling vector field 
problems such as structural problems without external applied forces. All of 
these problems should be in static or stationary condition. In this section, the 
FEM steps are presented, from the strong formulation step to the construction 
of the global sparse matrix step (steps 1 to 6 of Fig. [^. 

The strong formulation of the Poisson equation is given by [3]: 

V • (cV<))(r)) = 0 in D 

(j) = 4> on F^ (2) 

= 0 on F, 

where (j) represents a general scalar variable on the domain H, i.e. can be the 
electrical potential or temperature, depending on the specific applications. The 
position vector is r and the term c can represent the electrical or thermal conduc¬ 
tivity for an isotropic material. Term 0 is a prescribed value of scalar variable (j) 
at boundary F^ (Dirichlet condition) and the normal component of the flow (or 
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flux) Qn is stated for simplicity as zero at boundary (Neumann condition) for 
the purposes of computational evaluation only. After the integration by parts 
of Eq. ([^, the weak form is found as: 

I {S/iyf{cV(j))dn = 0 (3) 

Jn 

where v is an arbitrary function |3] . Based on the weak form of the Poisson equa¬ 
tion, the FE formulation can be found. Accordingly, the domain is discretized 
into EEs; next, in each FE, the scalar variable is approximated as: 

n 

= (4) 

a—1 

where n is the number of nodes in the element and, according to the Galerkin 
principle, v = Na are the shape functions. Substituting Eq. 0 into Eq. 0 
and evaluating the integrals for all elements, a system of equation is obtained 
as: 


= F; F = 0 (5) 

where K is the so-called global sparse matrix or stiffness matrix. Term $ is 
the nodal solution vector and global load vector F is modified by the boundary 
conditions, ceasing to be a vector of zeroes. The K matrix is obtained as a sum 
of contributions of the matrices obtained in numerical integration for each FE. 
This step is known as assembly [3]: 

nel 

K = Y.K ( 6 ) 

e=l 

where nel is the number of EEs and fcg is the local matrix of element e (local 
stiffness matrix), which is obtained as: 

P n 

fee = / cB^Bdn, with B = V ViVa (7) 

In our approach we consider that coefficient c, as a part of material prop¬ 
erties array (MP), can vary from one element to the other, but it remains 
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constant for a specific element, which means it can be outside the integral in 
Eq. Q . The global sparse matrix construction is made in Eq. @ and Eq. Q. 
However, for computational efficiency, Eq. 0 is solved numerically instead of 
by exact integration. There are several numerical procedures to solve an inte¬ 
gral, but the most common in EEM is the Gauss-Legendre, or simply Gauss 
quadrature [3]. For using the Gauss quadrature, a transformation by using 
the reference element technique must be applied, in which physical coordinates 
{x,y,z) are transformed into natural coordinates (r, s,t). The equivalence be¬ 
tween these systems of coordinates in the local matrix calculation (see the Eq. 
0 ) is represented by: 


ke = cj B^{x,y,z)B{x,y,z)dxdydz 
111 


( 8 ) 


-1 -1 -1 

where J is the Jacobian matrix and term | J| is its determinant. After that, the 
Gauss quadrature is applied to the normalize domain (right hand side of Eq. 
([^) as: 


M, M,- Mk 

^e = cEEE^" (^2 ; 5 B , Sj , t/,;) I t/(r2 , Sj , t/,;) W (9) 

2—1 j — 1 k—1 

where Mi, Mj,Mk, rt, Sj,tk and Wi, Wj,Wk are the number, points and weights 
of the integration points in each direction in a 3D domain. In our implementa¬ 
tion, we use hexahedral EEs with eight nodes in the discretization of the domain, 
also known as 8-node brick elements which is a low-order tri-linear FE. The 8- 
node brick element was selected based on the idea that the majority of FE codes 
are based on the low-order elements and few works have focused on this type of 
elements |13j . The points and weights for an exact numerical integration of the 
8-node brick element can be consulted in . 
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3. Numerical implementation 


A typical way to construct the global sparse FE matrix is using serial com¬ 
puting. There are several ways to construct sparse matrices; however, the fastest 
method is to create the triplelQform first, and then, using a sparse function to 
compress the triplet sparse format, avoiding some memory challenges [21]• De¬ 
spite the triplet format being simple to create, it is difficult to use in most sparse 
matrix algorithms; a compress sparse format is thus required |5]. A common 
compress sparse format is the CSC (compressed sparse column), used by default 
in MATLAB [25]. 

Algorithm[^shows a typical implementation for the sparse matrix generation 
using the MATLAB syntax. This algorithm needs a subroutine that makes the 
numerical integration for computing local matrix fee ^^nd local load vector f^. 
However, this work is focused only on generating matrix K; load vector F is 
not considered for parallel computation (this term is not considered because it 
is computationally much less demanding [T51 [H]). The numerical integration 
subroutine changes with the PDE, element type and basis functions; it is a 
function of the nodal coordinates (C) and any nodal fields such as forces or 
boundary conditions [BCs) and material properties (MP) of element e [17]. 
After computing local matrix kg, which is stored in array K^u, a mapping from 
local to global degrees of freedom (DOFs) is required. The mapping is made 
considering connectivity matrix {E), which gives the global node numbers that 
compose an element, and the number of DOFs per node (DOFxn). As a result 
of the mapping function, the positions in the global sparse matrix of each entry 
of fee is obtained, i.e. row (Indrow) and column [Indgoi) indices, which are 

^In the triplet sparse format the row and column indices that specify the position of NNZ 
(non-zero) entries must be stored. This means that each NNZ entry requires three numbers: 
the row and column position using integer numbers and the value of that entry using a double 
precision number. In this format, several NNZ entries with the same row/column indices can 
be repeated, which makes the difference with the coordinate (COO) sparse matrix format 
where entries with same row/column indices are summed. 



stored in arrays K^ows and Kcois, respectively. Finally, the global sparse matrix 
is obtained in a single call of an assembly function (e.g. the MATLAB sparse 
function [SUM]). 


Algorithm 1 Global sparse FE matrix construction in serial computing. 
1 ; Initialize rows^ ^^cois and K^yais fo zero, 

2: for each element e in nel do 

3: fee < — NumericalIntegration{C, MP, BCs); 

4; .) fee(.), 

5 : {Indrow,Indcoi) < — Mapping{E{e), DOFxn)-, 

6 ; royjsi^e ^.) — Ind 

row (•) 5 

^colsip^ •) — .^^deo;(.), 

8 : end for 

9 : K = sparse{Krows,Kcois,Kyais); 


Several methods have been proposed to parallelize the global sparse matrix 
construction. In this work, instead of viewing the matrix generation subroutine 
as a single block, it is separated in two subroutines: the numerical integration 
subroutine and the assembly subroutine, which allows the analysis looking for 
parallelism opportunities. The numerical integration subroutine has several 
opportunities for parallelism m- 

1. Computing all Gauss integration points in parallel; 

2. Parallelizing each fee component [151116j : 

3. Finally, it is also possible compute fee in parallel for all elements [uns]. 

The loop over integration points (first approach) is a good strategy for higher- 
order FEs but it is harder to parallelize due to inherent data rac^ in updating 
entries in fee with contributions from different integration points, and the degree 

^The race condition or race hazard is the behavior of a system in which the output is 
dependent on a sequence of events. It becomes a bug when events do not happen in the 
proper order. Race conditions can occur in computer software, especially multithreaded or 
distributed programs. 
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of concurrency is generally lower m- The second approach is also helpful for 
higher-order FEs, but not convenient for low-order elements. The parallelization 
of the loop over low-order elements (third approach) is the most natural way 
because a thread can be charged entirely of one local matrix fcg [mill- Hence, 
the third approach is selected in our implementation since the 8-node brick FE 
offers low degree of parallelism using the first two approaches. 

In the computation of fee there are no dependences and just the information 
from the specific 8-node brick element is necessary. As a consequence, numerical 
integration for different elements are also perfectly parallelizable [15] . A thread 
is used here in to compute each local matrix feg. Then, with a one-dimensional 
grid of thread blocks, all FEs in the mesh can be computed. 

In the eight-node brick element for scalar problems (1 DOF per node), each 
fee is a square dense matrix having 64 (8 x 8) components, but exploiting the 
matrix symmetry, only the lower or upper triangular matrix part, including its 
diagonal, should be considered. This leads to a significant saving in floating 
point operations and memory requirements, since approximately half the lo¬ 
cal matrix components are needed, specifically, 36 components instead of 64. 
Additionally, there are calculations that need to be made only once for the 
same FE type, such as the shape function and their derivatives in the physical 
coordinates. We use the same EE type to discretize the whole domain. 

Algorithm [^presents the CUBA kernel to compute the local stiffness matrix 
fee. A thread is responsible of computing the 36 components of fee and store 
them in the GPU global memory as a vector in array Kyais- For each inte¬ 
gration point, each thread must serially perform several operations, such as the 
Jacobian matrix (J) computation, its determinant (|«7|) and its inverse (U~^), 
in addition to the gradient matrix (B) and the local stiffness matrix (fee). The 
shape functions (TV) and their derivatives with respect to natural coordinates 
(dN) evaluated at each integration point are computed only once and they are 
stored in a fast GPU memory. The computation of matrix J requires three 
dense matrix-vector multiplications, i.e. the multiplications of dN (3 x 8) by 
nodal coordinates in Gartesian coordinates (8x3). The computation of \J\ is 
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performed with the determinant formula of a 3 x 3 matrix and J~^ with the 
formula to invert matrices of the same size. Matrix B is computed as the multi¬ 
plication of J~^ (3 X 3) by dN (3x8). Finally, element matrix is computed 
as presented in Eq. Q with some modifications to calculate only the lower 
triangular matrix components. 

Algorithm 2 CUBA kernel for numerical integration in the 8-node brick ele¬ 
ment_ 

1: __globaL_ void NumericalIntegration{ke, E, C, MP) 

2: tid = blockDim.x * blockidx.x -\- threadidx.x] // Thread ID 
3: // Initialize local variables 
4: if tid < nel then 

5: for i = 0;i<8;i-|--f do / / Loop over integration points 

6: read N{i) and dN(i); 

7: compute J, |J| and 

8: compute B] 

9: compute the lower triangular part of fee and store it in Kyais] 

10: end for 

11: end if 


However, if the FE mesh is denser, the numerical integration may not be 
executed because this is a heavily memory-consuming operation and the GPU 
memory cannot allocate it entirely m- Thus, a simple domain partitioning 
based on the GPU memory availability is used, which is a modified implemen¬ 
tation of that proposed by Komatitsch et al. m- Gonsequently, the numerical 
integration is developed in subgroups of EEs {GPUg), i.e. the numerical in¬ 
tegration kernel is invocated to operate a FEs set. The elements subgroups 
are executed sequentially by the GPU, but the numerical integration over EEs 
within a group are executed in parallel. With this approach, the construction 
of the global EE matrix can be made from any mesh size, only limited by the 
GPU memory. Thus, if the available GPU memory is GPUma and the required 
memory by the numerical integration subroutine is GPUmr (known through 
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GPU query functions), the work done by the GPU is divided into groups based 
on the following formula: 



( 10 ) 


Each group of FEs is integrated numerically in the GPU, and the results 
allocated in the global GPU memory are downloaded to the GPU memory, 
leaving the GPU memory free for a new group of FEs. Every kernel is uploaded 
with only the data required by the FEs belonging the group that will be executed 
(only part of arrays C and E), which allows a better use of memory. 

After the numerical integration over all elements, the assembly of all local 
matrices into the global matrix is executed. To do this, the assembly step 
can be viewed as a sparse matrix format conversion [3121]. As stated before, 
the global position of each component of the Kaii matrix should be stored in 
arrays Krows and K^ois^ which is an implicit sparse matrix format storage 
[3. The assembly step consists in converting the triplet format into the GSC 
format. This function requires sorting and reducing operations, which are not 
well suited for parallel computing due to they are intrinsic serial operations. In 
fact, the assembly cannot be considered an embarrassingly parallel algorithm 
with no dependencies, as is considered the numerical integration is |15j . The 
dependencies present in the assembly algorithm can cause race condition, which 
should be avoided to prevent wrong results. 

To remove dependencies present in the assembly algorithm there are some 
options, but the best for GPU programming is the use of coloring techniques or 
atomic operations [3 HD. Using the first option, sets of elements with different 
colors are obtained, with two elements having the same color if their local stiff¬ 
ness matrices do not coincide at any global matrix entry. Given this element 
partition, the numerical integration algorithm followed by immediate matrix as¬ 
sembly can be executed in parallel for elements within a group of the same color 
and proceeds color by color sequentially. However, this option requires pre- 
computing operations to divide the mesh into colors, where the problem to find 
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the optimal number of colors is NP-hard. Although there are many heuristics 
for finding nearly minimal colorings |26j , this operation can take substantial ex¬ 
ecution time in meshes with a large number of FEs. On the other hand, atomic 
operations are undesirable because the penalty on the GPU execution time can 
be high [T7] . 

Dziekonski et al. [7] use the sparse matrix conversion concept, which is exe¬ 
cuted in the GPU with the advantage that the assembly occurs in parallel using 
atomic operations. Using this approach, arrays Krows and Kcois must be com¬ 
puted and stored in GPU, which substantially increases memory requirements. 
Although this approach is good on high-end GPUs provided with high-memory 
capacity to avoid memory transactions between processors, it is less interesting 
for many low-end video cards users. Additionally, in many current personal 
computers, the memory installed in the GPU motherboard is often four to eight 
times as large as than that installed on the graphics card m- Hence, a good 
option is to make the sparse matrix format conversion in the GPU. 

Therefore, a methodology for using GPU for intrinsic parallel and GPU for 
intrinsic serial algorithms in the global FE matrix construction is proposed. 
The components of Kyais are parallel computed in GPU, but indices Krows 
and Kcois are computed and stored in the GPU, since their computation is 
simple [2l]. After computing the global FE matrix in triplet format in GPU 
and GPU, as described above, the sparse matrix format conversion is executed 
in MATLAB using one of the following options: the native MATLAB function 
spars^ the sparse2 function from SuiteSpars^ and the sparse^create function 
from MILAMIl^ A graph showing the flow of our proposed implementation is 
presented in Fig. 

In Fig. there is a preprocessing phase that can be performed by any 
specialized software for complex domains. However, as our goal is evaluating 

®MATLAB R2013b, http://www.mathworks.com/help/matlab/ref/sparse.html 
^SuiteSparse 4.0.2, http://www.cise.ufl.edu/research/sparse/SuiteSparse/ 

®Mutils 0.3, http://www.milamin.org/ 
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the computational time involved in the matrix generation, we use cubic domains 
discretized by 8-node brick elements. This simple discretization is performed in 
MATLAB generating three arrays: connectivity matrix E, which indicates the 
nodes that compose an element; nodal matrix coordinates C, which specify the 
Cartesian coordinates of the nodes; and material properties matrix MP, which 
states the material properties of each element. As this subroutine is simple and 
takes a few seconds to generate large meshes, it is executed in the CPU. 

The second phase in Fig. shows the necessary steps for building the global 
FE matrix, as described previously. In this step, the heterogeneous computing 
concept is used: while the GPU is executing the numerical integration kernel, 
the CPU is used for generating the indices of the sparse matrix in triplet format. 
Once the local matrices from all elements are computed, the assembly function 
can be executed, and the global matrix is obtained. After the global system is 
obtained, the boundary conditions are applied, and the solution of the linear 
system is found, but these steps are not shown in the figure because it is out of 
the scope of this paper. 


Time 



Figure 2: Global sparse FE matrix construction. 


4. Results 

The GPU results are obtained using a 2GB NVIDIA Quadro 4000 video 
card based on Fermi architecture. The CPU computations were performed in 
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MATLAB R2013b using an Intel Xeon E5-2620 CPU at 2.0 GHz with six cores 
in a 64-bit Windows 7 operating system with 40GB of RAM. However, in the 
serial implementation only one CPU core is used. As stated before, the matrix 
is generated in two steps: first, conducting the numerical integration (NI) over 
all FEs, and second, assembling all local matrices into a global matrix. The 
results are presented separately in the following subsections. 

^.1. Numerical integration 

The NI subroutine is programmed in CUBA C and is called from MAT- 
LAB workspace using the parallel.gpu. CUDAKernel constructor, which creates 
a CUDAKernel object. The properties of the CUDAKernel object such as the 
number of blocks in the grid and the number of threads in the block are also 
configured in MATLAB. After the creation and configuration of the kernel, it 
is evaluated using the feval MATLAB function. 

The code executed in GPU is compared with its respective serial implemen¬ 
tation counterpart executed in one GPU core. Fig. shows the computational 
time when the NI routine is executed in GPU and GPU for different meshes. In 
the time spent by the GPU, the time spent in memory transactions is also taking 
into account: uploading data from GPU to GPU memory using the MATLAB 
function gpuArray^ and vice versa, downloading data from GPU to GPU mem¬ 
ory using the function gather. The gather function guarantees that all work is 
finished in the GPU, and it ensures accurate time measurement using the tic-toc 
MATLAB functions. Fig. [^also shows the speedup reached by the GPU imple¬ 
mentation. The GPU speedup is calculated as the ratio between the serial (tg) 
and parallel (tp) execution time. The advantage of the GPU implementation is 
observed in the curves presented. 

The NI routine can compute meshes with almost five million of 3D FEs in a 
GPU with-low-memory capacity without using mesh partitioning. Although the 
NI routine is considered memory bounded the implemented code consumes 
little memory due to the low-order of the FE selected. The maximum time 
spent in the GPU for the largest mesh is 7.56 seconds, while for the same mesh 
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Figure 3: Execution time in the numerical integration subroutine and GPU speedup as a 
function of the mesh size. 

in serial computing in the CPU, the NI routine spent 875.1 seconds, which 
means a GPU speedup of 115.8X. The GPU speedup dramatically increases 
its value using 8000 FEs. After that, the GPU speedup rate is slow; indeed, 
with more than four million FEs, the GPU speedup decreases. This fact is 
explained by the ratio between floating point operations (flops) and memory 
operations (memops). When the data volume transferred from one to another 
memory type (memops) is larger than the flops; the problem thus becomes 
memory bounded and the speedup is decreased. This is known as memory wall, 
in which there is a difference between the speed at which data are transferred to 
the processor and the rate at which instructions are executed m . A common 
problem in transferring data between the CPU and the GPU memories is found 
in the PCI-express port, which limits the rate of moving data between these 
processors. Eor this reason, minimizing the transaction between the processor 
memories is recommended. 
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4-2. Assembly 

The assembly step is performed in the CPU for two reasons: the assem¬ 
bly algorithm presents few parallel opportunities and requires high temporary 
memory. As the CPU cores are designed to maximize the execution speed of 
sequential programs (latencj0 [TU] , this processor is well suited for serial tasks, 
as proposed here (assembly). The memory requirements of the assembly routine 
are shown in Table [TJ 

As indicated in Section the assembly phase can be viewed as a sparse 
matrix conversion from triplet to CSC format. Tableshows the percentage of 
compression in the non-zero elements (NNZ) going from one to another sparse 
format through different meshes. This table also shows the percentage of mem¬ 
ory saving when the CSC is used instead of the triplet form. The percentage 
of compression and memory saving are not equal due to the matrix structure, 
which depends on the nodes and elements numeration; however, these percent¬ 
ages seem to be very similar. 

Having into account that the assembly subroutine initially requires a lot of 
memory for storing the three arrays Krows, Kcois and Kyais, it is not feasible 
for a GPU implementation, particularly if GPUs with low-memory capacity are 
used for forming large matrices. 

Three different routines for changing the sparse matrix format are considered 
herein: the sparse function from MATLAB (Sparse), the sparse2 function from 
SuiteSparse (Sparse2) and the sparse-create function from MILAMIN (Spar- 
seCreate). The execution time spent in the NI routine, executed in the GPU, 
and the assembly routine, executed in the GPU, are presented in Fig. Addi- 

®Latency, in a basic definition, is the amount of time to complete a task, which mean that 
it is measured in units of time. While the CPU tries to minimize the latency, the GPU tries to 
maximize the throughput. Throughput is tasks completed per unit time measured in units as 
stuff per time, such as jobs completed per seconds. Therefore, the NI subroutine is executed in 
the GPU because we need to obtain many local matrices in the unit of time; and the assembly 
subroutine is executed in the CPU because the whole task is required to be finished in a short 
time. 
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Table 1: Memory requirements in the triplet and CSC sparse formats for different mesh 
discretizations. 


# of FEs 1 

1000 

8 000 

64000 

512000 

1728000 

2 744000 

4096 000 

5 832000 

8000000 

NNZ in Triplet 

36000 

288000 

2 304000 

18432000 

62 208 000 

98 784000 

147456000 

209 952000 

288000000 

NNZ in CSC 

15 561 

118121 

920 241 

7264481 

24408 721 

38 710841 

57728961 

82135081 

112 601201 

NNZ compression 

' 56.8% 

59.0% 

60.1% 

60,6% 

60,8% 

60.8% 

60,9% 

60.9% 

60.9% 

Memory used (triplet) 

' 0.58 MB 

4.61 MB 

36.9 MB 

294.9 MB 

995.3 MB 

1 580.5 MB 

2 359.3 MB 

3 359.2 MB 

4608.0 MB 

Memory used (CSC) 

0.26 MB 

1.96 MB 

15,3 MB 

120.5 MB 

404.7 MB 

641,8 MB 

957.0 MB 

1361.6 MB 

1 866.6 MB 

Memory saving 

' 54.9% 

57.4% 

58.6% 

59.1% 

59.3% 

59.4% 

59,4% 

59.5% 

59.5% 


tionally, this figure shows the time spent for computing arrays Krows and K^ois 
(Index) used by Sparse and Sparse2 routines. The NI routine spends more time 
than Sparse2 and SparseCreate routines for all mesh discretizations, even though 
the NI routine is executed in parallel in the GPU and the assembly routines are 
executed serially in the CPU. This occurs because the NI is an intensive routine 
that requires many operations between small matrices, such as multiplications, 
additions and inversions, whereas assembly only requires sorting and reducing 
operations [7]. 
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Figure 4: Execution time spent by the assembly using three functions and different mesh sizes. 


The sparse MATLAB function consumes significantly more time and mem- 
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ory than the other assembly functions, and for meshes larger than 200 thousand 
NNZs, it requires more time than the NI routine. The large memory usage 
explains the drawbacks of this function. In this routine, the row and column 
indices must be entered as double type arrays, which increases the memory re¬ 
quirements and performance overhead making this function the most inefficient 
assembly routine [24j . In turn, although the sparse2 function has an excellent 
performance, this routine requires computing and storing the index arrays, and 
their computations require almost the same time as the assembly function itself, 
meaning that the computational time spent by sparse2 can be duplicated if the 
time for computing Kj-ows and K^ois is considered. 

Regarding the sparsc-create function for the sparse matrix conversion, in 
all discretization cases, the computational time invested in assembly is always 
the lowest. Indeed, this function employs less time than the index arrays effi¬ 
ciently computed as MILAMIN [21]. Moreover, the sparse-Create function does 
not require the computation of the index arrays for the assembly step, saving 
memory and computations. However, this function is a specialized routine to 
form only matrices arising in FEM, which can represent a disadvantage when 
general sparse matrices are required. 

4-3. Partitioning the numerical integration 

Finally, as shown above, the numerical integration executed in parallel by us¬ 
ing GPU and assembling the NNZ element based on sparsc-create CPU function 
are the most efficient implementation. However, without doing a mesh parti¬ 
tioning for the NI routine, the problem size that can be solved using low-memory 
capacity GPUs is limited. Hence, using an approach as the one presented in 
Section [3] for partitioning the mesh based on available GPU memory, FE ma¬ 
trices can be constructed as large as CPU memory allows. In addition, as the 
CPU memory is usually greater than the GPU memory, larger matrices can be 
constructed than those created only in the GPU. 

Fig. [^shows the computational time for numerical integration and assembly 
routines in the construction of the global sparse matrix, considering different 
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mesh sizes. Adding the time spent by these two routines, the time necessary to 
construct the global sparse matrix (MatGen) is obtained. Table [^presents these 
results for different mesh sizes and the time percentage employed per function. 
The largest mesh size constructed has 27.27 million nodes and 27 million 3D 
elements. As stated before, the numerical integration routine always consumes 
more time than the assembly. Specifically, the assembly function consumes less 
than the 40% of the total matrix generation time for all the mesh cases: 37.9% is 
the greatest percentage spent for assembling 1 728 million FEs (an intermediate 
mesh size) and the lowest percentage was presented assembling 1 000 FEs (the 
lowest mesh size). These results indicate an appropriate approach to generate 
large EE matrices from unstructured meshes in systems in which the GPU 
memory is lower than the CPU memory. Additionally, this method shows the 
benefits of heterogeneous computing: using the CPU for intrinsic serial routines 
and the GPU for intrinsic parallel and intensive computations. 



Size of the global system formed 


Figure 5: Execution time spent generating global sparse FE matrices of several sizes. 

As a final comment, for reducing the time for matrix generation, one option 
is to form temporally small sparse matrices and next, to assemble all these 
matrices for forming the global matrix. Since the numerical integration for large 
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Table 2: Computational time spent in the sparse matrix construction. 


# of FEh 1 

1000 

8 000 

64000 

512000 

1728 000 

2 744000 

4096 000 

5 832000 

8000 000 

15625000 

27000000 

Matrix size 

1331 

9 261 

68921 

531441 

1771 561 

2 803 221 

4173 281 

5 929 741 

8120 601 

15813 251 

27270901 

MatGen time (s) 

0,02 

0,03 

0.15 

1.17 

3.98 

5,92 

8.85 

15.3 

23.8 

40.9 

115.0 

NI time 

' 82.8% 

80.9% 

68.4% 

62.9% 

62.1% 

66,2% 

66.2% 

70,9% 

67.8% 

70.5% 

82.4% 

Assembly time 

' 17.2% 

19.1% 

31.6% 

37.1% 

37.9% 

33.8% 

33.8% 

29,1% 

32.2% 

29.5% 

17.6% 


meshes is divided into groups, which are executed sequentially in the GPU, it 
can can execute kernels asynchronousljQ an approach that can accelerate large 
sparse matrix generation in FEM is assembling all local matrices belonging to 
a FEs group while the GPU is executing the NI routine. 

5. Conclusions 

The implementation proposed shows that large global sparse matrices arising 
in the numerical solution of EPDEs by FEM can be generated in heterogeneous 
computing with limited computational resources. In fact, with only 2GB of 
memory in the GPU, FEM matrices of 3D unstructured meshes with up to 
27 million of FEs are generated. Moreover, a significant acceleration in the 
numerical integration executed in the GPU as compared to the serial CPU im¬ 
plementation is achieved. The maximum speedup reached by the GPU is 126x, 
and the mean value is 121x, which shows good performance of the developed 
code. Additionally, the assembly function {sparse-create) used in this work is 
an efficient routine that requires lower runtime and memory consumption com¬ 
pared with the other two functions {sparse and sparse2). In fact, sparse-create 
consumes less runtime than the numerical integration routine, even though the 
latter is implemented in parallel computing. The maximum time spent by the 
sparsc-create function took only twenty seconds for assembling 27 million FEs, 

^Some function calls are asynchronous to facilitate concurrent execution, returning the 
control to the CPU thread before the GPU has completed a specific task. This means that 
the GPU is able to execute several kernels while the CPU executes other routines. 
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one-fifth of the time spent by NI routine. 

Finally, although the proposed methodology is applied to the numerical so¬ 
lution of the Poisson equation by using FEM, this approach is general and can 
be used to construct not only the stiffness matrix from static or stationary prob¬ 
lems, but also the mass matrix from dynamics problems and for solving other 
types of PDEs by using the FEM. 
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